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Abstract 



We consider a Gaussian process formulation of the multiple kernel learning prob- 
lem. The goal is to select the convex combination of kernel matrices that best 
explains the data and by doing so improve the generalisation on unseen data. 
Sparsity in the kernel weights is obtained by adopting a hierarchical Bayesian 
approach: Gaussian process priors are imposed over the latent functions and gen- 
eralised inverse Gaussians on their associated weights. This construction is equiv- 
alent to imposing a product of heavy-tailed process priors over function space. A 
variational inference algorithm is derived for regression and binary classification. 

1 Introduction 

Kernel-based methods are well-established tools for supervised learning, allowing to perform var- 
ious tasks, such as regression or binary classification, with linear and non-linear predictors. Like 
most statistical models, kernel-based methods can be considered within two frameworks: in the fre- 
quentist approach, estimators are obtained by minimizing a regularized empirical risk, leading e.g. 
to kernel ridge regression or the support vector machine IISTC04irSBS00l : in the Bayesian approach, 
Gaussian processes (GPs) provide a Bayesian interpretation to kernel-based methods |RW06|, with 
the potential to learn the kernel parameters from the data without having to use cross-validation. 

Crucial to the predictive performance of kernel methods is the choice of the kernel function. In the 
Bayesian setting, the kernel function (often called covariance function) determines the correlations 
between the predictions we make. Assuming that the predictor's smoothness is fully specified by 
these correlations can be formalised by a Gaussian process imposed over function space. Techniques 
based on automatic relevance determination have been successful at learning the parameters of ker- 
nel functions such as the individual length scales of the squarred exponential kernel |RW06|. In the 
frequentist setting, a specific parameterization of kernel functions has led to a significant amount of 
work, namely positive linear combination of pre-define d kernel fu nctions (or kernel matrices), lead- 
ing to the multiple kernel learning (MKL) framework llLCB+041 IBLJ0 41 . The first contribution of 
this paper is to propose a Gaussian process (GP) formulation of the multiple kernel learning frame- 
work, which we refer to as multiple Gaussian process (MGP) models. Its second contribution is to 
provide a framework to consider all -norms at once and to determine /ram data whether we should 
use a sparsity-inducing prior or not. Currently, there is no consensus in the frequentist community 
on how to choose the type of regularization. In practice, however, the choice of the regulariser leads 
to solutions of very different kinds. For example, when considering an ^i-norm a sparse solution 
will be obtained whether or not it is supported by the data. Obviously, if all kernels are important 
for prediction, this will be detrimental |,GN09i . 

2 Multiple Gaussian process model for regression 

Let {yn}n=i t'e the set of noisy targets and {x„i, . . . , x„p}^^]^ the set of features, which are as- 
sumed to be non-random column vectors. We consider a weighted linear model of the P feature 
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vectors with i.i.d. Gaussian noise: 

y|Xi, . . . ,Xp, wi, . ...wp^ 7V(X;pXpWp,r"^lAr), (1) 

where y = (yi, . . . , j/at)^ and Ijv is the identity matrix of dimension N. The weights associated to 
the feature matrix Xp e M^^-^p are denoted by Wp e and the residual precision by r. 

The case of interest is the one where the weight vectors are sparse, i.e., many of their elements 
are (nearly) zero. However, we do not know a priori the degree of sparsity. From a Bayesian 
perspective, the spike and slab prior is the golden standard for inducing sparsity. Here, follow a 
different approach and choose the prior p(wp) to be a Gaussian scale mixture |AM74| centred at 
zero. In effect we approximate the spike and slab prior by a continuous prior which favours sparse 
solutions. Although zero probability mass is put on exact zero values, the use of heavy-tailed priors 
allows us to infer large, as well as quasi zero values for the kernel weights. 

Formally, we impose the following product of zero-mean Gaussian scale mixtures on the weights: 

w|7^npAA(0,7p-V)' 7^npAA-i(c.,x», (2) 

where 7 — (71, ... , 7p)^ is the vector of unobserved scale variables on which independent gen- 
eralised inverse Gaussian densities (see Appendix [A]) are imposed. The marginal p{w) is then a 
symmetric generalised hyperbolic density lHu05il . which has fat tails compared to the Gaussian. 
This family contains the Student-t, the Laplace, the Gamma-variance and Jeffrey's as special cases. 

Given this probabilistic model, one can integrate out w, leading to a closed form expression for the 
marginal density of the observations: 

y|7^AA(0,Ep7p"'Kp + r-il^), (3) 

where Kp = XpXj G 'R^^^ is the kernel matrix associated with the p-th feature matrix. Clearly, 
the marginal density is well-defined for any set of valid kernel matrices since 7p > for all p. 

The linear additive model defined in ([T]) corresponds to the weight-space view representation of the 
multiple Gaussian process model (MGP). From ([3]l, however, we see that the marginal density only 
depends on the linear kernel matrices {Kp}^^]^. Hence, the model can be generalised to a non-linear 
additive model by replacing these linear kernel matrices by non-linear ones. This new representation 
corresponds to the multiple /Mncf/on-ipace view representation. 

Let {/p(-)}^=i be a set of P latent functions on which we impose scaled Gaussian process priors: 

fp{-)\lp-GV{On;'kp{;-)), (4) 

for all p. The functions {kp{-, ■)}p^i are covariance functions, which are also valid kernel functions 
IIRW06I . Again we consider i.i.d. Gaussian noise, but assume {yn}n=i ^re noisy observations of a 
sum of P latent function values fp G M^. The likelihood function and the MGP prior are given by 

y|f , r ^ AA(Epfp, t-^In) - AA(Mf , t-^In), (5) 
f|7^npA^(0,7p"'Kp)=AA(0,K), (6) 

where = {ij , . . . ,iJ),M ^ (lJ(g)lAr) and K = diag{7f ^Ki, . . . , 7p^Kp}. Vector Ip is the 
unit vector of dimension P and the operator (g) denotes the Kronecker product. The prior on 7 is still 
given by (j2|i and the marginal p(y|7) has the same form as in the weight-space view representation. 

The MGP model corresponds to imposing P independent non-Gaussian process priors over function 
space. If we condition on the corresponding scale variable, any finite subset of latent function 
values is distributed according to a multivariate Gaussian marginal. For any of these marginals one 
can integrate out the scale variable, such that any finite set of latent function values is distributed 
according to a product of P independent multivariate Gaussian scale mixture densities: 



i-H M{0,j-'Kp)p{^p)djp^l[ ^-^^ ^ — /. (7) 

P P (^v^(0+f/K-ifp)/xj 

where K^{-) is the modified Bessel function of the second kind. Hence, the prior measure imposed 
over function space is a heavy-tailed one, known as the generalised hyperbolic measure IIBN77I 
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IHu05L The Gaussian process is recovered for cj — > cx) and the symmetric multivariate zero-mean 
hyperbolic process is obtained for cj = — 1. Other special cases include the multivariate Gamma- 
variance process (oj < Q and <j) = 0), the multivariate Laplace process (w = — 1 and (p = 0), the 
multivariate Student-i process (w > and x — ^) ™d the multivariate Cauchy process (oj — 1/2 
and X — 0)- 

The most straightforward approach for the estimation of 7 is to use type II maximum a postriori (or 
type II maximum likelihood in absence of prior on 7 as adopted in iKGUDlOl ). The optimisation 
can be performed using standard nonlinear optimisation tools, but the regulariser needs to be chosen 
in advance. Instead, we turn our attention to the inference problem of these parameters from data. 
We view 7 as a latent variable and the desired level of sparsity is learnt from the data by optimising 
the hyperparameters by type II maximum UkeUhood (ML). 

3 Variational inference with type II maximum likelihood 

We follow a mean field approach ||Bea031IBis06l| . In order to find an analytically tractable solution, 
the posterior over the latent function values f and the scale vector 7 is assumed to factorise given 
the data, that is q{f, 7) — q{i) Yip <z(7p)- It can be shown that the variational posteriors maximising 
the negative variational free energy (a lower bound to the log-marginal likelihood) are given by 
q{i) — N{p,, S) and 9(7) = J^^ N~ {ujp, Xp, 4'p)- The parameters are defined as 

M-rSM^y, S= (K-i+rM^M)"', = + 7V/2, Xv = X. 0p = + (fp^K^^fp), 
where K = diag{(7i)-iKi, . . . , (7p)-iKp}. 

The predictive MGP is obtained by assuming that (7(7) is peaked around its mean such that 
q{f{x.)) « (7(/(x)| (7)). The true predictive density can then be approximated by the following 
analytically tractable integral: 

y(x)|y^ yp(y(x)|/(x))g(/(x))d/(x)=gP(m(x),«(x) + r-i). (8) 

where 

"iW = Ep(7p>"'kp(xp)TB-V, (9) 

«(x) = Ep(7p>"'fcp(xp,Xp)-EpE,(7p)-'(7«)"'k,(Xp)TB-ik,(x,)}, (10) 

with B — J2r{'yr)^^^r + T^^^N- From these expression we see that the posterior mean and 
variance have the same form as in standard GP regression; the kernel is simply replaced by a convex 
combination of kernels. Note, moreover, that the expression m(x) has the same form as the one we 
would obtain with a frequentist method such as kernel ridge regression. 

The ML II updates for the hyperparameters are obtained by solving the following expressions (which 
are simple root finding equations, with unique solutions, hence easily solved by binary search): 

: Pln^f-P^^^^^^^M) +Ep(ln7p> = 0, (11) 

^^T'-f\/f^"(^) + 5Ep(7p-^)=0, (12) 

^ ■■ - l^fiRUVx^) + \ Ephp) = 0, (13) 

where Rcj {■) = K^+i {■)/ {■). These updates are obtained by direct maximising of the variational 
bound. The update for r is obtained in the same manner. 

4 MGP for classification 

We restrict ourselves to binary classification and consider a scaled probit model in which the likeli- 
hood is derived from the Gaussian cumulative density. A probit model is equivalent to a Gaussian 
noise and a step function hkelihood IIAC93||OW001 . 
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Table 1: Average root mean square error for toy regression data (lower is better). The multiple 
Laplace process performs worse than the Student-t and the Gamma-variance process when the gen- 
erating process is sparse. ARD performs poorly when the generating process is not sparse. In the 
case of MKL the prior choice of the regulariser leads to more sensitivity to model misspecifications. 

Number of active kernels 1 out of 10 3 out of 10 10 out of 10 



Multiple Student-t 

Multiple Laplace 

Multple Gamma-variance 

ARD 

MKL ii 

MKL £2 

MKL 4/3 



.033 (±.027) .067 

.034 (±.028) .076 

.033 (±.027) .067 

.033 (±.027) .066 

.037 (±.032) .066 

.830 (±.655) .831 

.097 (±.062) .233 



(±.032) .719 (±.221) 

(±.035) .704 (±.204) 

(±.032) .719 (±.223) 

(±.031) .746 (±.223) 

(±.030) .720 (±.203) 

(±.399) .762 (±.251) 

(±.098) .719 (±.238) 



Let {tn}n=i be the class labels, with t„ G { — 1, +1} for all n. The likelihood (jsjl is replaced by 

My-Unlitnyn), y |f ^ AA(E/p, r-ij^), (14) 

where I{z) = Ifor^; ^ and otherwise. 

As in the case of regression, we consider a mean field approximation and assumes the posterior is of 
the form q{y)q{i)q{j). We further assume the variational posterior g(y) is a product of truncated 
Gaussians (see Appendix [B]i: 

^(y) « UnHtnynWi'^nAn) = {Ut^^+l J^+i^^n, K)) {Ilt^ = -1 J^- {f^n, Xn)) , (15) 

where — J^pifpi^np)) and A„ — 1/t. The posterior mean and the posterior covariance of f are 
unchanged, except that y is replaced by i'±. The elements of v± are defined in (20 1. The posterior 
(7(7) and the updates for the hyperparameters are identical to the ones in MGP regression. 

In Bayesian classification the label with highest probability P(i(x)|t) is selected. Since an exact 
computation is analytically intractable, we assume the posteriors g(y) and (7(7) are highly peaked 
around their mean leading to the following classification rule: 

P(t(x) = ±l|t) « P(<(x) = ±l|t, z.±, (7)) = $( ± m(x)/v/«(x) + T-i), (16) 

where m(x) and i;(x) are as before with y replaced by Deciding whether the label is —1 or +1 
is equivalent to using the sign of to(x) as the decision rule. 



5 Discussion 

We compare frequentist and Bayesian approaches to kernel combination. We demonstrate the flexi- 
bility and the performance of the MGP models on the following two detat sets: 

Toy regression data. We generate random functions from the Hilbert spaces induced by 10 Lapla- 
cian kernels and add Gaussian i.i.d. noise; we show results on three different settings: a sparse prob- 
lem, where only one kernel is used to generate the response, a semi-sparse problem with 3 functions 
are used and a non-sparse problem where all ten functions are active. Table [T] compares several 
MGP models and several cross-validated MKL models with fixed regularisation norms. Fig.[2]in the 
Appendix shows that the hierarchical Bayesian approach is able to adapt to the sparsity of the data. 

Flowers data set{^ Due to a lack a space we do not describe the data and the features, but only 
mention it is a standard MKL benchmark for multi-class image classification. For each of the 102 
flower classes we learn a one-versus-all classifier. Fig. [T] shows the ROC curve for two classes 
when considering a Student-t process, for which we obtained an average AUG = .948 ± .057. The 
average AUG for Gamma-variance process and ARD are respectively given by = .957 ± .050 and 
.947 ± .058. All are better than state-of-flie-art MKL results I1MEZ08L 

'www.robots.ox.ac.uk/ vgg/data/flowers/102/ 



4 



ROC Curve 

AUG . 0.993 
Conv Hull 
AUG = 0.994 
O Acc(O) . 0.00455 
□ Max Acc 0.996 



0.2 



0.4 



Specificity 




ROC Curve 

AUC = 0.745 
Conv Huil 
AUC = 0.759 
O Acc(0).0.0115 
□ IVlax Acc 0.988 



0.2 



0.4 



Specificity 



(a) Typical roc curve (class 102). 



(b) Worse roc curve (class 96). 



Figure 1: Flowers data. ROC curves obtained for Student-t process one- versus -all classification for 
two flower classes. The ROC curves obtained for Gamma-variance process are slightly better. 
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A Generalised inverse Gaussian density 

The generalised inverse Gaussian density is defined as follows |j0r82| : 



(17) 



where x > and Kui{-) is the modified Bessel function of the second kind with index a; £ K. Depending on 
the value taken by uj, we have the following constraints on x and 0: 



cj > 
ui = 
cj < 



X > 0, > 0, 
X > 0, <?!> > 0, 
X>0, 0^0. 



Let us define Rui{-) — Kui+i{-)/ Kuj{-). The following expectations are useful: 

(->-,&4y^), fin-Uy^l (ln.}=lrr./f+ ^^"^;(^\ (18) 

y (p y X V ''^ 

When X = and uj > 0, the generalised inverse Gaussian density reduces to the Gamma density. When (p = 
and cj < 0, it reduces to the inverse Gamma density. The expectations simplify also. 



B Truncated Gaussian density 

The (positive/negative) truncated Gaussian density is defined as follows: 

where $(a) = /"^ A/'(0, l)dz is the cumulative density of the unit Gaussian. 

Let x± ~ N±{jx, rj"^). The mean and variance are given by 

{x±) = ^±a^A/±(0i^i,(7^), 
{(x± - {x±)f) = T aVAA±(0|M, cy^) - a^N±{Q\^^,a^f. 



(19) 



(20) 
(21) 



C Example of the inferred kernel weights 



T T 



23456789 10 



23456789 10 
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i t t + 



23456789 10 



(a) 1 active kernel. 



(b) 3 active kernels. 



(c) 10 active kernels. 



Figure 2: Toy regression data. Shown are the box-and-whisker plots of the expected weight for each 
kernel when consider a Gamma prior (Student-t process). Other generalised inverse Gaussian priors 
perform as well. The MGP model was run on hundred different data set realisations. 
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